The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla's DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Because noise can interfere with downstream analysis, image denoising has come to occupy an important place in the image processing toolbox. The most accurate state-of-the-art denoisers typically train on a representative dataset. But gathering a training set is not always feasible, so interest has grown in blind zero-shot denoisers that train only on the image they are denoising. The most accurate blind-zero shot methods are blind-spot networks, which mask pixels and attempt to infer them from their surroundings. Other methods exist where all neurons participate in forward inference, however they are not as accurate and are susceptible to overfitting. Here we present a hybrid approach. We first introduce a semi blind-spot network where the network can see only a small percentage of inputs during gradient update. We then resolve overfitting by introducing a validation scheme where we split pixels into two groups and fill in pixel gaps using domino tilings. Our method achieves an average PSNR increase of $0.28$ and a three fold increase in speed over the current gold standard blind zero-shot denoiser Self2Self on synthetic Gaussian noise. We demonstrate the broader applicability of Pixel Domino Tiling by inserting it into a preciously published method.
translated by 谷歌翻译
Federated learning is a collaborative method that aims to preserve data privacy while creating AI models. Current approaches to federated learning tend to rely heavily on secure aggregation protocols to preserve data privacy. However, to some degree, such protocols assume that the entity orchestrating the federated learning process (i.e., the server) is not fully malicious or dishonest. We investigate vulnerabilities to secure aggregation that could arise if the server is fully malicious and attempts to obtain access to private, potentially sensitive data. Furthermore, we provide a method to further defend against such a malicious server, and demonstrate effectiveness against known attacks that reconstruct data in a federated learning setting.
translated by 谷歌翻译
To date, the best-performing blind super-resolution (SR) techniques follow one of two paradigms: A) generate and train a standard SR network on synthetic low-resolution - high-resolution (LR - HR) pairs or B) attempt to predict the degradations an LR image has suffered and use these to inform a customised SR network. Despite significant progress, subscribers to the former miss out on useful degradation information that could be used to improve the SR process. On the other hand, followers of the latter rely on weaker SR networks, which are significantly outperformed by the latest architectural advancements. In this work, we present a framework for combining any blind SR prediction mechanism with any deep SR network, using a metadata insertion block to insert prediction vectors into SR network feature maps. Through comprehensive testing, we prove that state-of-the-art contrastive and iterative prediction schemes can be successfully combined with high-performance SR networks such as RCAN and HAN within our framework. We show that our hybrid models consistently achieve stronger SR performance than both their non-blind and blind counterparts. Furthermore, we demonstrate our framework's robustness by predicting degradations and super-resolving images from a complex pipeline of blurring, noise and compression.
translated by 谷歌翻译
对心脏周围环境的脂肪库的定量是评估与多种疾病相关的健康风险因素的准确程序。但是,由于人为的工作量,这种类型的评估并未在临床实践中广泛使用。这项工作提出了一种用于自动分割心脏脂肪垫的新技术。该技术基于将分类算法应用于心脏CT图像的分割。此外,我们广泛评估了几种算法在此任务上的性能,并讨论了提供了更好的预测模型。实验结果表明,心外膜和纵隔脂肪分类的平均准确性为98.4%,平均正面速率为96.2%。平均而言,关于分割的患者和地面真相的骰子相似性指数等于96.8%。因此,迄今为止,我们的技术已经获得了心脏脂肪自动分割的最准确结果。
translated by 谷歌翻译
前庭造型瘤(VS)通常从内耳生长到大脑。它可以分为两个区域,分别对应于内耳管内或外部。外部区域的生长是决定疾病管理的关键因素,其次是临床医生。在这项工作中,提出了将细分分为内部/优质零件的VS分割方法。我们注释了一个由227个T2 MRI实例组成的数据集,对137名患者进行了纵向获得,不包括术后实例。我们提出了一种分阶段的方法,第一阶段进行整个肿瘤分割,第二阶段使用T2 MRI以及从第一阶段获得的掩码进行了术中/极度分割。为了提高预测的肉类边界的准确性,我们引入了特定于任务的损失,我们称之为边界距离损失。与直接仪内分割任务性能(即基线)相比,评估了该性能。我们所提出的方法采用两阶段方法和边界距离损失,分别达到0.8279+-0.2050和0.7744+-0.1352,分别为室外和室内室内区域,显着提高了基线,这给出了0.7939+的骰子得分-0.2325和0.7475+-0.1346分别用于室外和室内区域。
translated by 谷歌翻译
从药物研究到解剖模型,腹部腹部登记具有各种应用。然而,由于人类腹部的形态异质性和可变性,它仍然是一个充满挑战的应用。在为此任务提出的各种注册方法中,概率位移注册模型通过比较两个图像的点的特征向量来估计点子集的位移分布。这些概率模型具有信息性和健壮性,同时允许设计大量位移。由于位移分布通常是在点子集(我们称为驾驶点)上估算的,因此由于计算要求,我们建议在这项工作中学习驾驶点预测指标。与先前提出的方法相比,以端到端方式优化了驾驶点预测变量,以推断针对特定注册管道定制的驾驶点。我们评估了我们的贡献对与不同模式相对应的两个不同数据集的影响。具体而言,我们比较了使用驱动点预测器或其他2种标准驾驶点选择方法之一时,比较了6种不同概率位移登记模型的性能。提出的方法改善了12个实验中的11个。
translated by 谷歌翻译
最近的自我监督方法使用了大规模的图像文本数据集来学习强大的表示,这些表示无需填补即可将其转移到许多任务。这些方法通常假定其图像与其(简短)字幕之间存在一对一的对应关系。但是,许多任务需要有关多个图像和长文本叙述的推理,例如描述带有视觉摘要的新闻文章。因此,我们探索了一个新颖的环境,其目标是学习一个自我监督的视觉语言表示,该表示对改变文本长度和图像数量是可靠的。此外,与假设字幕的先前工作不同,我们假设图像仅包含与文本的宽松说明对应关系。为了探索这个问题,我们介绍了一个大规模的多模式数据集,其中包含31m文章,22m图像和1M视频。我们表明,对具有多个图像的更长叙述,最新的图像文本对齐方法并不强大。最后,我们介绍了一个直观的基线,该基线在GoodNews数据集上在零摄像集检索上胜过10%。
translated by 谷歌翻译
本文介绍了针对非负矩阵分解的新的乘法更新,并使用$ \ beta $ -Divergence和两个因素之一的稀疏正则化(例如,激活矩阵)。众所周知,需要控制另一个因素(字典矩阵)的规范,以避免使用不良的公式。标准实践包括限制字典的列具有单位规范,这导致了非平凡的优化问题。我们的方法利用原始问题对等效规模不变的目标函数的优化进行了重新处理。从那里,我们得出了块状大量最小化算法,这些算法可为$ \ ell_ {1} $ - 正则化或更“激进的” log-regularization提供简单的乘法更新。与其他最先进的方法相反,我们的算法是通用的,因为它们可以应用于任何$ \ beta $ -Divergence(即任何$ \ beta $的任何值),并且它们具有融合保证。我们使用各种数据集报告了与现有的启发式和拉格朗日方法的数值比较:面部图像,音频谱图,高光谱数据和歌曲播放计数。我们表明,我们的方法获得了收敛时类似质量的溶液(相似的目标值),但CPU时间显着减少。
translated by 谷歌翻译